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Abstract. While existing mathematical descriptions can accurately account for phenomena at 
microscopic scales (e.g. molecular dynamics), these are often high-dimensional, stochastic and their 
applicability over macroscopic time scales of physical interest is computationally infeasible or imprac- 
tical. In complex systems, with limited physical insight on the coherent behavior of their constituents, 
the only available information is data obtained from simulations of the trajectories of huge numbers 
of degrees of freedom over microscopic time scales. This paper discusses a Bayesian approach to 
deriving probabilistic coarse-grained models that simultaneously address the problems of identifying 
appropriate reduced coordinates and the effective dynamics in this lower-dimensional representa- 
tion. At the core of the models proposed lie simple, low-dimensional dynamical systems which serve 
as the building blocks of the global model. These approximate the latent, generating sources and 
parameterize the reduced-order dynamics. We discuss parallelizable, online inference and learning 
algorithms that employ Sequential Monte Carlo samplers and scale linearly with the dimensionality 
of the observed dynamics. We propose a Bayesian adaptive time-integration scheme that utilizes 
probabilistic predictive estimates and enables rigorous concurrent s imulation over macroscopic time 
scales. The data-driven perspective advocated assimilates computational and experimental data and 
thus can materialize data-model fusion. It can deal with applications that lack a mathematical de- 
scription and where only observational data is available. Furthermore, it makes non-intrusive use of 
existing computational models. 
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1. Introduction. The present paper is concerned with the development of prob- 
abilistic coarse-grained models for high-dimensional dynamical systems with a view of 
enabling multiscalc simulation. We describe a unified treatment of complex problems 
described by large systems of deterministic or stochastic ODEs and/or large number 
of data streams. Such systems arise frequently in modern multi-physics applications 
either due to the discrete nature of the system (e.g. molecular dynamics) or due to 
discretization of spatiotemporal models (e.g. PDEs): 



where dim(y) >> 1 (e.g. R d , d >> 1). Stochastic versions are also frequently 
encountered: 



where u t is a driving stochastic process (i.e. Wiener process). Uncertainties could 
also appear in the initial conditions that accompany the aforementioned systems of 
equations. 
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Even though the numerical solution of (stochastic) ODEs is a well-studied sub- 
ject and pertinent computational libraries are quite mature, traditional schemes are 
impractical or infeasible for systems which are high-dimensional and exhibit a large 
disparity in scales. This is because most numerical integrators must use time-steps 
of the order of the fastest scales which precludes solutions over long time ranges that 
are of interest for physical and engineering purposes. In the context of atomistic 
simulations, practically relevant time scales exceed typical integration steps of ~ Ifs 
by several orders of magnitude [3J. Furthermore, when numerical solutions of tran- 
sient PDEs are sought, resolution and accuracy requirements give rise to systems with 
more than 10 9 degrees of freedom [1021 [22l 11281 [73] where the integration time steps 
are slaved by fast reaction rates or high oscillation frequencies. This impedes their 
solution and frequently constitutes computationally infeasible other important tasks 
such as stability analysis, sensitivity, design and control. 

Multiscale dynamical systems exist independently of the availability of mathe- 
matical models. Large numbers of time series appear in financial applications, meteo- 
rology, remote sensing where the phenomena of interest unfold also over a large range 
of time scales [1391 180] . A wealth of time series data is also available in experimental 
physics and engineering which by themselves or in combination with mathematical 
models can be useful in analyzing underlying phenomena [1061 1881 1108j by deriving 
reduced, predictive descriptions. 

Quite frequently the time evolution of all the observables is irrelevant for physical 
and practical purposes and the analysis is focused on a reduced set of variables or 
reaction coordinates y t = P(y t ) obtained by an appropriate mapping V : y — >• y. 
The goal is then to identify a closed, deterministic or stochastic system of equations 
with respect to y u e.g. : 

^ = Hy t ), vt^y (i-3) 

In the context of equilibrium theormodynamics where enmscmblc averages with re- 
spect to the invariant distribution of y t arc of interest, coarse-graining amounts to 
free-energy computations |25j . In the noncquilibrium case and when an invariant 
distribution exists, a general approach for deriving effective dynamics is based on 
Mori-Zwanzig projections [H3 EH H3 HE HH El- Other powerful numerical ap- 
proaches to identify the dynamical behavior with respect to the reduced coordinates 
include transition path sampling, the transfer operator approach, the nudged elas- 
tic band, the string method, Perron cluster analysis and spectral decompositions 
[3"9l SH g3J HH1 HOI 1105] . Marked efforts in chemical kinetics have led to an array of 
computational tools such as computational singular perturbation |981l99j . the intrinsic 
low-dimensional manifold approach [1041 1144] and others [120, 113,153. Notable suc- 
cesses in overcoming the timescale dilemma have also been achieved in the context of 
MD simulations [S3 [1341 H35J [132] (or Hamiltonian systems in general (TT2 l [TOO ] [127] ) . 

In several problems, physical or mathematical arguments have led analysts to 
identify a few, salient features and their inter-dependencies that macroscopically de- 
scribe the behavior of very complex systems consisting of a huge number of indi- 
viduals/agents/components/degrees of freedom. These variables parameterize a low- 
dimensional, attracting, invariant, "slow" manifold characterizing the long-term pro- 
cess dynamics |71j . Hence the apparent complexity exhibited in the high-dimensionality 
and the multiscale character of the original model is a pretext of a much simpler, latent 
structure that, if revealed, could make the aforementioned analysis tasks much more 
tractable. The emergence of macroscopic, coherent behavior has been the foundation 
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of coarse-grained dynamic models that have been successful in a wide range of ap- 
plications. The coarse-grained parameterization and associated model depend on the 
analysis objectives and particularly on the time scale one wishes to make predictions. 
Modern approaches with general applicability such as the Equation-free method |92j 
or Heterogeneous Multiscale Method (HeMM,[47]) are also based on the availability 
of reduced coordinates and in the case of HeMM of a macroscopic model which is 
informed and used in conjunction with the microscale model. 

Largely independently of the developments in the fields of computational physics 
and engineering, the problem of deriving, predictive reduced-order models for a large 
number of time series that potentially exhibit multiple scales has also been addressed 
in statistics and machine learning communities [1401 l53j with applications in network 
analysis [33J , environmetrics |141) , sensor network monitoring [1421 1117] , moving ob- 
ject tracking [4], financial data analysis [5], computer model emulation |103j . Signifi- 
cant advances have been achieved in modeling |131) , forecasting [143| and developing 
online, scalable algorithms (FJ 11261 [H2 [Ml [HU 1145) . that are frequently based on 
the discovery of hidden variables that provide insight to the intrinsic structure of 
streaming data [Ml ISO UM Q2S [HS] . 

The present paper proposes a data-driven alternative that is able to automat- 
ically coarse-grain high-dimensional systems without the need of preprocessing and 
availability of physical insight. The data is most commonly obtained by simulations 
of the most reliable, finest-scale (microscopic) model available. This is used to infer a 
lower-dimensional description that captures the dynamic evolution of the system at a 
coarser scale (i.e. a macroscopic model). The majority of available techniques address 
separately the problems of identifying appropriate reduced coordinates and the effec- 
tive dynamics in this lower-dimensional representation. It is easily understood that 
the solution of one affects the other. We propose a general framework where these 
two problems are simultaneously solved and coarse-grained models are built from the 
ground up. We propose procedures that concurrently infer the macroscopic dynamics 
and their mapping the high-dimensional, fine-scale description. As a result no errors 
or ambiguity are introduced when the fine-scale model needs to be reinitialized in 
order to obtain additional simulation data. To that end, we advocate a largely unex- 
plored in computational physics perspective based on the Bayesian paradigm which 
provides a rigorous foundation for learning from data. It is capable of quantifying 
inferrential uncertainties and, more importantly, uncertainty due to information loss 
in the coarse-graining process. 

We present a Baycsian state-space model where the reduced, coarse-grained dy- 
namics are parametrized by tractable, low-dimensional dynamical models. These can 
be viewed as experts offering opinions on the evolution of the high-dimensional ob- 
servables. Each of these modules could represent a single latent regime and would 
therefore be insufficient by itself to explain the whole system. As is often the case 
with real experts, their opinions are valid under very specific regimes. We propose 
therefore a framework for dynamically synthesizing such models in order to obtain 
an accurate global representation that retains its interpretability and computational 
tractability. 

An important contribution of the paper, particularly in view of enabling simula- 
tions of multiscale systems, is online inference algorithms based on Sequential Monte 
Carlo that scale linearly with the dimensionality of the observables d (Equation (|1.1|> ). 
These allow the recursive assimilation of data and re-calibration of the coarse-grained 
dynamics. The Baycsian framework adopted provides probabilistic predictive esti- 
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mates that can be employed in the context of adaptive time-integration. Rather than 
determining integration time steps based on traditional accuracy and stability met- 
rics, we propose using measures of the predictive uncertainty in order to decide how 
long into the future the coarse-grained model can be used. When the uncertainty 
associated with the predicitve estimates exceeds the analyst's tolerance, the fine-scale 
dynamics can be consistently re-initialized in order to obtain additional data that 
sequentially update the coarse-grained model. 

In agreement with some recently proposed methodologies [521 > the data-driven 
strategy can seamlessly interact with existing numerical integrators that are well- 
understood and reliably implemented in several legacy codes. In addition, it is suitable 
for problems where observational/experimental data must be fused with mathematical 
descriptions in a rigorous fashion and lead to improved analysis and prediction tools. 

The structure of the rest of the paper is as follows. In Section [5] we describe the 
proposed framework in relation with the state-of-the-art in dimensionality reduction. 
We provide details of the probabilistic model proposed in the context of Baycsian 
State-Space models in Section 12.21 Section 12.31 is devoted to inference and learning 
tasks which involve a locally-optimal Sequential Monte Carlo sampler and an online 
Expectation-Maximization scheme. The utilization of the coarse-grained dynamics in 
the context of a Bayesian (adaptive) time-integration scheme is discussed in 12.41 and 
numerical examples are provided in Section [3] 

2. Proposed Approach. 

2.1. Prom static-linear to dynamic-nonlinear dimensionality reduction. 

The inherent assumption of all multiscale analysis methods is the existence of a lower- 
dimensional parameterization of the original system with respect to which the dy- 
namical evolution is more tractable at the scales of interest. In some cases these slow 
variables can be identified a priori and the problem reduces to finding the necessary 
closures that will give rise to a consistent dynamical model. In general however one 
must identify the reduced space y as well as the dynamics within it. 

A prominent role in these efforts has been held by Principal Component Analysis 
(PCA) -based methods. With small differences and depending on the community 
other terms such as Proper Orthogonal Decomposition (POD) or Karhunen-Loeve 
expansion (KL), Empirical Orthogonal Functions (EOF) have also been used. PCA 
finds its roots in the early papers by Pearson jlllj and Hotelling [53] and was originally 
developed as a static dimensionality reduction technique. It is based on projections 
on a reduced basis identified by the leading eigenvectors of the covariance matrix 
C . In the dynamic case and in the absence of closed form expressions for the actual 
covariance matrix, samples of the process y t e W l at N distinct time instants i,; are 
used in order to obtain an estimate of the covariance matrix: 

1 N 

c k,c n = j- ^2{y ti - t^ivti - m) t (2-i) 

i=l 

where fi = -4 Vti is the empirical mean. If there is a spectral gap after the 
first k eigenvalues and V k is the d x K matrix whose columns are the K leading 
normalized eigenvectors of Cjv then the reduced-order model is defined with respect 
to y t = V KVt- The reduced dynamics can be identified by a Galerkin projection (or 
a Petrov-Galerkin projection) of the original ODEs in Equation (|l.ip : 

d 4r = V T K f(V T K y t ) (2.2) 




Fig. 2.1. The phase space is assumed two-dimensional for illustration purposes i.e. y t = 
(Vt^ ' Vt )• Each black circle corresponds to a realization y t . . V : y — > y is the projection operator 
from the original high-dimensional space y to the reduced-space y. 



Hence the reduced space y is approximated by a hyperplane in y and the projection 
mapping V linear (Figure 2.1(a) |. While it can be readily shown that the projection 
adopted is optimal in the mean square sense for stationary Gaussian processes, it is 
generally not so in cases where non-Gaussian processes or other distortion metrics are 
examined. The application of PCA-based techniques, to high-dimensional, multiscalc 
dynamical systems poses several modeling limitations. Firstly, the reduced space 
y might not be sufficiently approximated by a hyperplane of dimension K « d. 
Even though this assumption might hold locally, it is unlikely that this will be a 
good global approximation. Alternatively, the dynamics of the original process might 
be adequately approximated on AT-dimcnsional hyperplane but this hyperplane might 
change in time. Secondly, despite the fact that the projection on the subspace spanned 
by the leading eigenvectors captures most of the variance of the original process, in 
cases where this variability is due to the fast modes, there is no guarantee that y t will 
account for the long-range, slow dynamics which is of primary interest in multiscalc 
systems. Thirdly, the basic assumption in the estimation of the covariance matrix, is 
that the samples y t are drawn from the same distribution, i.e. that the process y t 
is stationary. A lot of multiscalc problems however involve non-stationary dynamics 
(e.g. non-equilibrium MD [78l|30]). Hence even if a stationary reduced-order process 
provides a good, local, approximation to y t , it might need to be updated in time. 
Apart from the aforementioned modeling issues, significant computational difficulties 
are encountered for high-dimensional systems (d = dim(y) » 1) and large datasets 
(N >> 1) as the K leading eigenvectors of large matrices (of dimension proportional 
to d or N) need to be evaluated. This effort must be repeated, if more samples become 
available (i.e. N increases) and an update of the reduced-order model is desirable. 
Recent efforts have concentrated on developing online versions j!37j that circumvent 
this problem. 

The obvious extension to the linear projections of PCA is nonlinear dimensionality 
reduction techniques. These have been the subject of intense research in statistics and 
machine learning in recent years ( [1211 11151 11301 l44l 11231 [TOl [9] ) and fairly recently 
have found their way to computational physics and multiscale dynamical systems (e.g. 
[32" 1 [96 l \W7\ [59]). They are generally based on calculating eigenvectors of an affinity 
matrix of a weighted graph. While they circumvent the limiting, linearity assumption 
of standard PCA, they still assume that the underlying process is stationary (Figure 



.1(b)). Even though the system's dynamics might be appropriately tracked on a 
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lower-dimensional subspace for a certain time period, this might not be invariant 
across the whole time range of interest. The identification of the dynamics on the 
reduced-space y is not as straightforward as in standard PCA and in most cases, 
a deterministic or stochastic model is fit directly to the projected data points [31[ 
[50l [58] . More importantly since the inverse mapping V~ x from the manifold y to 
y is not available analytically, approximations have to be made in order to find pre- 
images in the data-space ]11[ 150] . From a computational point of view, the cost of 
identifying the projection mapping is comparable to standard PCA as an eigenvalue 
problem on a N X N matrix has to be solved. Updating those eigenvalues and the 
nonlinear projection operator in cases where additional data become available implies 
a significant computational overhead although recent efforts [122] attempt to overcome 
this limitation. 

A common characteristic of the aforementioned techniques is that even though 
the reduced coordinates are learned from a finite amount of simulation data, there is 
no quantification of the uncertainty associated with these inferences. This is a critical 
component not only in cases where multiple sets of reduced parameters and coarse- 
grained models are consistent with the data, but also for assessing errors associated 
with the analysis and prediction estimates. It is one of the main motivations for 
adopting a probabilistic approach in this project. Statistical models can naturally 
deal with stochastic systems that frequently arise in a lot of applications. Most 
importantly perhaps, even in cases where the fine-scale model is deterministic (e.g. 
Equation ([l.ljl ). a stochastic reduced model provides a better approximation that can 
simultaneously quantify the uncertainty arising from the information loss that takes 
place during the coarse-graining process [S2J [!JSj ■ 

A more general perspective is offered by latent variable models where the observed 
data (experimental or computationally generated) is augmented by a set of hidden 
variables [13] . In the case of high- dimensional, multiscale dynamical systems, the la- 
tent model corresponds to a reduced-order process that evolves at scales of practical 
relevance. Complex distributions over the obscrvables can be expressed in terms of 
simpler and tractable joint distributions over the expanded variable space. Further- 
more, structural characteristics of the original, high-dimensional process y t can be 
revealed by interpreting the latent variables as generators of the observables. 

In that respect, a general setting is offered by Hidden Markov Models (HMM, 
[64] ) or more generally State-Space Models (SSM) [HI EH IHD]- These assume the 
existence of an unobserved (latent) process y t <E R K described by a (stochastic) ODE: 

dy " 

-—^-=f(y t ;Wt) (transition equation) (2-3) 
which gives rise to the obscrvables y t e K d as: 

y t = h(y t ,v t ) (emission equation) (2-4) 

where Wt and Vt are unknown stochastic processes (to be inferred from data) and 
/ : M. K — > M K , h : R K — > M. d are unknown measurable functions. The transition 
equation defines a prior distribution on the coarse-grained dynamics whereas the 
emission equation, the mapping that connects the reduced-order representation with 
the observable dynamics. The object of Bayesian inference is to learn the unobserved 
(unknown) model parameters from the observed data. Hence the coarse-grained model 
and its relation to the observable dynamics are inferred from the data. 
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The form of Equations (|2.3[) and (|2.4j) affords general representations. Linear and 
nonlinear PCA models arise as special cases by appropriate selection of the functions 
and random processes appearing in the transition and emission equations. Note for 
example that the transition equation (Equation (|2.3|) ) for y t in the case of the PCA- 
based models reviewed earlier is given by Equation (|2.2[) and the emission equation 
(Equation (|2.4|) ) that relates latent and observed processes is linear, deterministic and 
specified by the matrix of K leading eigenvectors V k ■ 

An extension to HMM is offerered by switching-state models [74] EH E21 1124] 
which can be thought of as dynamical mixture models EJJ EH ■ The latent dynamics 
consist of a discrete process that takes M values, each corresponding to a distinct 
dynamical behavior. This can be represented by an M -dimensional vector z t whose 
entries are zero except for a single one m which is equal to one and represents the 
active mode/cluster. Most commonly, the time-evolution of z t is modeled by a first- 
order stationary Markov process: 

z t+ i = Tz t (2.5) 

where T = [T mn ] is the transition matrix and T m%n = Pr[z m ^ t +i = 1 | z n ,t = 1]- In 
addition to z tl M processes 6 M. K , m = 1,...,M parameterize the reduced- 

order dynamics (see also discussion in section [2.2)1 . Each is activated when z m ^ = 1. 
In the linear version (Switching Linear Dynamic System, SLDS 0) and conditioned 

on z m .t = 1, the observables y t arise by a projection from the active as follows: 

y t = P (m )a> l) + v t , v t ~ N(0, S) (i.i.d) (2.6) 

where arc d x K matrices (K << d) and X is a positive definite d x d matrix. 

Such models provide a natural, physical interpretation according to which the behavior 
of the original process y t is segmented into M regimes or clusters, the dynamics of 
which can be low-dimensional and tractable. From a modeling point of view, the idea 
of utilizing a mixture of simple models provides great flexibility [TBI HH H25| , [70] as it 
can be theorized that given a large enough number of such components, any type of 
dynamics can be sufficiently approximated. In practice however, a large number might 
be needed, resulting in complex models containing a large number of parameters. 

Such mixture models have gained prominence in recent years in the machine 
learning community. In |15j for example, a dynamic mixture model was used to 
analyze a huge number of time series, each corresponding to a word in the English 
vocabulary as they appear in papers in the journal Science. The latent discrete 
variables represented topics and each topic implied a distribution on the space of 
words. As a result, not only a predictive summary (dimensionality reduction) of the 
high-dimensional observables was achieved but also an insightful deconstruction of the 
original time series was made possible. In fact current research in statistics has focused 
on infinite mixture models where the number of components can be automatically 
inferred from the data ( |129[ [2T [ fT2~ ] [56 ] [57] ) . In the context of computer simulations 
of high-dimensional systems, such models have been employed by [5SJ [551 [501 EH EI] 
where maximum likelihood techniques were used to learn the model parameters. 

In the next sections we present a novel model that generalizes SLDS. Unlike 
mixture models which assume that y t is the result of a single reduced-order pro- 
cess at a time, we propose a partial-membership model (referred to henceforth as 
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Partial-Membership Linear Dynamic System, PMLDS) which allows obscrvables to 
have fractional memberships in multiple clusters. The latent, building blocks are ex- 
perts [861 1871 1771 175) which, on their own, provide an incomplete, biased prediction 
but when their "opinions" are appropriately synthesized, they can give rise to a highly 
accurate aggregate model. 

From a modeling perspective such an approach has several appealing properties. 
The integrated coarse-grained model can be interpretablc and low-dimensional even 
for large, multiscale systems as its expressive ability does not hinge upon the complex- 
ity of the individual components but rather is a result of its factorial character f[67|V 
Intricate dynamical behavior can be captured and decomposed in terms of simple 
building blocks. It is highly-suited for problems that lack scale separation and where 
the evolution of the system is the result of phenomena at a cascade of scales. Each 
of these scales can be described by a latent process and the resulting coarse-grained 
model will account not only for the slow dynamics but also quantify the predictive 
uncertainty due to the condensed, fast-varying features. 

From an algorithmic point of view we present parallelizable, online inference/learning 
schemes, which can recursively update the estimates produced as more data become 
available i.e. if the time horizon t of the observables y 1 . t increases. Unlike some 
statistical applications where long time series are readily available, in the majority 
of problems involving computational simulations of high-dimensional, multiscale sys- 
tems, data is expensive (at least over large time horizons) as they imply calls to the 
microscopic solvers. The algorithms presented are capable of producing predictive 
estimates "on the fly" and if additional data is incorporated, they can readily update 
the model parameters. In addition, such schemes can take advantage of the natural 
tempering effect of introducing the data sequentially which can further facilitate the 
solution of the global estimation problem. More importantly perhaps, the updating 
schemes discussed have linear complexity with respect to the dimensionality d of the 
original process y t . 

2.2. Partial-Membership Linear Dynamic System. We present a hierar- 
chical Bayesian framework which promotes sparsity, intcrpretability and efficiency. 
The framework described can integrate heterogeneous building blocks and allows for 
physical insight to be introduced on a case-by-case basis. When dealing with high- 
dimensional molecular ensembles for example, each of these building blocks might 
be an (ovcrdamped) Langcvin equation with a harmonic potential [17) 180) 183) . It is 
obvious that such a simplified model would perhaps provide a good approximation 
under specific, limiting conditions (e.g. at a persistent metastable state) but definitely 
not across the whole time range of interest. Due to its simple parameterization and 
computational tractability, it can easily be trained to represent one of the "experts" 
in the integrated reduced-model. It is known that these models work well under spe- 
cific regimes but none of them gives an accurate global description. In the framework 
proposed, they can however be utilized in a way that combines their strengths, but 
also probabilistically quantifies their limitations. 

A transient, nonlinear PDE can be resolved into several linear PDEs whose simpler 
form and parameterization makes them computational tractable over macroscopic 
time scales and permits a coarser spatial discretization. Their combination with 
time-varying characteristics can give rise to an accurate global approximation. Their 
simplicity and limited range of applicability would preclude their individual use. In 
the framework proposed however, these simple models would only serve as good local 
approximants and their inaccurate predictions would be synthesized into an accurate, 



irder models 



9 




timcfStep 



Fig. 2.2. Realizations of two hidden 
(M = 2) one-dimensional (K = 1) Ornstcin- 
Uhlcnbcck processes were used dx^ 1 = 
-fe< m )(a;[ m) -g£"°)dfc+(£ (m) ) 1/2 dWt with 
(ftW.gi 15 ,^ 1 )) = (l.,3.,2.) (fast) and 
(6( 2 ),qi 2) ,S( 2 )) = (0.01, -3., 0.02) (slow) 



global model. 

2.2.1. Representations with simple building blocks. We describe a proba- 
bilistic, dynamic, continuous-time, generative model which relates a sequence the ob- 
servations y t £ M. d at discrete time instants t = 1, 2, ... t with a number of hidden pro- 
cesses. The proposed model consists of M hidden processes x| m ' £ M. K , m = 1, . . . , M 
(K « d) which are assumed to evolve independently of each other and are described 
by a set of (stochastic) ODEs: 

^l- =gm ^ ; e^), m=l,...,M (2.7) 

This equation essentially implies a prior distribution on the space of hidden pro- 
cesses parameterized by a set of (unknown a priori) parameters 6 X ■ It should be 
noted that while the proposed framework allows for any type of process in Equation 
(|2.7[) . it is desirable that these are simple, in the sense that the parameters 9^ are 
low-dimensional and can be learned swiftly and efficiently. We list some desirable 
properties of the prior models |125j : 

• Stationarity: Unless specific prior information is available, it would be unrea- 
sonable to impose a time bias on the evolution of any of the reduced dynamics 
processes. Hence it is important that the models adopted arc a priori station- 
ary. Note that the posterior distributions might still exhibit non-stationarity. 

• Correlation Decay: It is easily understood that for any m and t\,t2, the 
correlation x[™^ and x[™^ should decay monotonically as | £2 — t\ | goes 
to +00. This precludes models that do not explicitly account for the time 
evolution of the latent processes and assume that hidden states are not time- 
dependent (e.g. static PC A models). 

• Other: Although this is not necessary, we adopt a continuous time model in 
the sense of |136j with an analytically available transition density which allows 
inference to be carried out seamlessly even in cases where the observables are 
obtained at non-equidistant times. As a result the proposed framework can 
adapt to the granularity of the observables and also provide exact probabilistic 
predictions at any time resolution. 

Although more complex models can be adopted we assume here that independent, 
isotropic Ornstein-Uhlenbeck (OU) processes are used to model the hidden dynam- 
ics x^ . The OU processes used comply with the aforementioned desiderata. In 
particular, the following parameterization is employed: 

dx { t m) = -4 ro) (* t - q { x m) )dt + (S£ l) ) l '*dWi m) (2.8) 

where are Wiener processes (independent for each m), bx > 0, qi™^ £ M. K 

and are positive definite matrices of dimension K x K. The aforementioned 
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(e) observable p = 0.99 (f) histogram p = 0.99 

Fig. 2.3. The logistic normal distribution was used to model the weights associated with each of 
the hidden processes depicted in Figure l2~2l In particular, an isotropic Ornstcin-Uhlenbeck process 
dzt = —b z (zt — q z )dt + X?/ 2 dW t with b z = 0.001, q z = [0, 0] T and S 2 . Graphs depict the resulting 
observable time history (left column) and its histogram (right column) arising from the unobserved 
processes in Figure l2~2l and for three values of p. It is noted that time histories exhibit fast and slow 
scales of the processes in Figure l2~2l Furthermore, by changing a single parameter (i.e. p) one can 
obtain two, three or a single metastable state (right column - peaks of the histogram). 



model has a Gaussian invariant (stationary) distribution N{q^ > , ) . The 

transition density denoted by p(x[ m ^ | x)™\) for time separation St is also a Gaussian 
N(lJ. tt st, s st) where: 

x . = xt} - (1 - e- bi ™ )st )(x[™l - q { r ] ) 

(2.9) 



It is not expected that simple processes on their own will provide good approxima- 
tions to the essential dynamics exhibited in the data y t . In order to combine the dy- 
namics implied by the M processes in Equation (|2.7I) . we consider an M— dimensional 
process z t such that J2 m =i z m.t = 1 an d z m t > 0, Vt and define an appropriate 
prior. The coefficients z TOit express the weight or fractional membership to each pro- 
cess/expert a^"^ at time t j76j . Wc use the logistic-normal model |15j as a prior for 
z t . It is based on a Gaussian process, Zt whose dynamics are also prescribed by an 
isotropic OU process: 

dz t = -b z (z t -q z )dt + S 1 J 2 dW t (2.10) 

and the transformation: 

z m ,t = — m , Vm,t (2.11) 
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The invariant and transition densities of z t are obviously identical to the ones for x[ 
with appropriate substitution of the parameters. The hidden processes {ajj }m=i 
and associated weights z t give rise to the obscrvablcs y t as follows (compare with 
Equation (|2"1)])): 



Vt 



Zm,t P {m) x[ m) + v t , v t ~ N(0, E) (i.i.d) (2.12) 



where p( m -* are d x K matrices (JsT << d) and E is a positive definite dxrf matrix. 
The aforementioned equation implies a series of linear projections on hypcrplanes of 
dimension K . The dynamics along those hypcrplanes are dictated by a priori inde- 
pendent process x[ m \ It is noted however the reduced dynamics are simultaneously 
described by all the hidden processes (Figure |2T4|) . This is in contrast to PC A meth- 
ods where a single such projection is considered and mixture PCA models where even 
though several hidden processes are used, at each time instant it is assumed that a 
single one is active. Due to the factorial nature of the proposed model, multiple dy- 
namic regimes can be captured by appropriately combining a few latent states. While 



mixture models (Figure 2.1(c) I provide a flexible framework, the number of hidden 
states might be impractically large. As it is pointed out in |67j . in order to encode 
for example a time sequence with 30bits of information one would need k = 2 30 dis- 
tinct states. It is noted that even though linear projections are implied by Equation 
(12.121) and Gaussian noise v t is used, the resulting model for y t is nonlinear and non- 
Gaussian as it involves the factorial combination of M processes {x[ m ^}^ =1 with z t 
which arc a posteriori non-Gaussian. 

The parameters z m ,t express the relative importance of the various reduced models 
or equivalently the degree to which each data point y t is associated with each of the 

M reduced dynamics x^ . It is important to note that the proposed model allows for 
time varying weights z m t and can therefore account for the possibility of switching 
between different regimes of dynamics. Figure [2~3l depicts a simple example (d = 1) 
which illustrates the flexibility of the proposed approach. 

The unknown parameters of the coarse-grained model consist of: 

• dynamic variables denoted for notational economy by &t (i.e. {x[ z t 
, fort = 1,2,...). 

• static variables denoted by (i.e. 8 ( x m) = (bi m) , q { x m) , S { " l) ) in Equation 
§H1> , Z = (b Z7 q z ,S z ) in Equation (g3D) and {P (m) }£f =1 ,E in Equation 

2.3. Inference and learning. Inference in the probabilistic graphical model 
described involves determining the probability distributions associated with the un- 
observed (hidden) static and dynamic parameters ®t of the model. In the Bayesian 
setting adopted this is the posterior distribution of the unknown parameters of the 
coarse-grained model. Given the observations (computational or experimental) of 
the original, high-dimensional process y x . T = {y t }t=i> we denote the posterior by 
tt(0,0 1:t ): 

likelihood prior 



tt(0, 1:t ) = p{&, &i. r y UT ) = r (2.13) 

P(Vl:r) 
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cluster 1 




cluster 2 



(a) SLDS with two clusters 



cluster 1 




intermediate clusters 
cluster 2 



(b) proposed Partial-Membership LDS 



Fig. 2.4. Comparison of SLDS (a) with two mixture components and the proposed model 
partial-membership model (b) 



The normalization constant p(y 1 . T ) is not of interest when sampling for © or ©i :T 
but can be quite useful for model validation purposes. 

A telescopic decomposition holds for the likelihood which according to Equation 
(f2~T2"l) is given by: 



V 



(y 1:T \&,@ 1:T ) = l[p(y t \&,& t ) 



(2.14) 



where the densities in the product are described in Equation (|2.17[) . Equation (|2.12D 
defines the likelihood p(y t \ 0,©t) which is basically the weighted product of the 
likelihoods under each of the hidden processes/experts xf 1 ' : 

1 M 

p(y t | 0, t ) = J] PS?* (ft I ®> *) ( 2 - 15 ) 

^ 1 ' m— 1 

where the normalizing constant c(@, t ) ensures that the density intergrates to one 
with respect to y t . According to Equation (|2.12[) : 

p m (y t |0,0 t )oc ^L^expj-i^-pW^fE-^^-pWx^)} (2.16) 

The likelihood can be written in a more compact form as: 

P(Vt I ©> ©*) « j^TT? exp {-i(y t - ^^fS- 1 ^ - W t X t ) } (2.17) 



where: 



and: 



( ^f,(^f,... ( ,rr 



A/if xl 



= [ Zl,t P« Z 3 , t 

dxMK 



zm 



t P (M) ] 



(2.18) 



(2.19) 



The first-order Markovian processes adopted for the prior modeling of the dynamic 
parameters t (Equations (|2.8|) . (|2.10p . [23)) ) imply that: 



p(0 1:T1 0)=K0)IlK©t I ©t-i,©) 



(2.20) 
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where p(®i | ©o, ®) = p(©i | ©) = fo(®i I ®) is the prior on the initial condition 
which in this work is taken to be the stationary distribution of the underlying OU 
processes (see discussion in Section I2.2.1[) and denoted for notational economy by 
u (. | 0). 

The posterior encapsulates uncertainties arising from the potentially stochastic 
nature of the original process y t as well as due to the fact that a finite number of 
observations were used. The difficulty of the problem is that both the dynamic (®i :T ) 
and the static parameters (0) are unknown. We adopt a hybrid strategy whereby 
we sample from the full posterior for the dynamic parameters @t and provide point 
estimates for the static parameters 0. If uniform priors are used for then the 
procedure proposed reduces to a maximum likelihood estimation. Non-uniform priors 
have a regularization effect which can promote the identification of particular features. 

While the hybrid strategy proposed is common practice in pertinent models ([64]), 
in the current framework it is also necessitated by the difficulty in sampling in the high- 
dimensional state space of (note that the projection matrices P^ m ^ in particular are 
of the dimension of the observables d and d >> 1) as well as the need for scalability 
in the context of high-dimensional systems. The static parameters are estimated 
by maximizing the log-posterior. 

L(0) = log7r(© | y 1:T ) = log f tt(0, 1:t | y 1:T ) d® 1:T (2.21) 

" " — T' 

posterior Equation arm 



Maximization of L(@) is more complex than a standard optimization task as it 
involves integration over the unobserved dynamic variables ®i :r . While maximization 
can be accelerated by using gradient-based techniques (e.g. gradient ascent), the di- 
mensionality of makes such an approach impractical as it can be extremely difficult 
to scale the parameter increments. We propose therefore adopting an Expectation- 
Maximization framework (EM) which is an iterative, robust scheme that is guaranteed 
to increase the log-posterior at each iteration ( [411 164j ). It is based on constructing 
a series of increasing lower bounds of the log-posterior using auxiliary distributions 

<7(©l:r): 

L(0) = log tt(© I y 1:r ) = log / 7r(0, 1:r | y UT ) d® 1:T 

= logJ g (0 l:T ) 7r(e g^ 1 - ) d® 1:T 

> J q(@i; T ) log 7r ^ e ^ : i T ^ 1;T ' ) d&i :T (Jensen's inequality) 
= F(q,&) 

(2.22) 

It is obvious that this inequality becomes an equality when in place of the auxiliary 
distribution g(©i :T ) the posterior 7r(©i :r | 0,y 1:T ) is selected. Given an estimate 
©^ at step s, this suggests iterating between an Expectation step (E-step) whereby 
we average with respect to gW(0i :r ) = 7r(©i :T | ®( s ',y 1:T ) to evaluate the lower 
bound: 

E-step: fM( ? M,e) = J g«(e 1:T ) lo g 7r(0, © 1:r | y 1:T ) d® U r (0 „v 

-/gW(0 1:T )log g W(0 1:r )d0 1;T (Z - Z6) 



and a Maximization step (M-step) with respect to F( s \q( s \ 0) (and in particular the 
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first part in Equation (|2.23p since the second does no not depend on 0): 



M-step: (s+1) 



arg max© (q^ , 0) 
argmaxe£,(.)( 8l:T ) [log7r(0, i:T | y 1:T )] 
argmax©Q(0 (s) ,0) 



(2.24) 



As the optimal auxiliary distributions q^(®i- T ) = 7r(©i :T | ®( s ',y 1 . T ) are in- 
tractable, we propose employing a Sequential Monte Carlo (SMC or particle filter, 
[151137]) scheme for estimating the expectations in the M-Step, i.e. Equation (|2.24[) . 
SMC samplers provide a parallelizable framework for non-linear, non-Gaussian fil- 
tering problems whereby the target distribution <j( s )(0i :T ) = 7r(0i :T | ® <,! '\y 1 . T ) is 
represented with a population of N particles ®^.^ and weig hts W {s > l) such that the 
expectation in Equation (|2.24[) can be approximated by: 



E 9 M(e l!r ) [log7r(0,0 1:r | » 1:T )] « J2 WM log^©,©^) | y 1:T ) (2.25) 



In section 12.3.11 we discuss a particle filter that takes advantage of the particular 
structure of the posterior and employs the locally optimal importance sampling dis- 
tribution. Nevertheless, SMC samplers involve sequential importance sampling, and 
their performance decays with increasing r as the dimension of the state space 0i :r 
increases even when resampling and rejuvenation mechanisms are employed ([7j). Re- 
cent efforts based on exponential forgetting have shown that the accuracy of the 
approximation can be improved (while keeping the number of particles N fixed) by 
employing smoothing (|69j) over a fixed-lag in the past ([20]). 

In this paper we make use of an approximate but highly efficient alternative pro- 
posed in [6] [7j [8]. This is based on the so-called split-data likelihood (SDL) first 
discussed in |116j . which consists of splitting the observations into blocks (overlap- 
ping or non-overlapping) of length L < r and using the pseudo-likelihood which arises 
by assuming that these blocks are independent. It is shown in [7J that this leads to an 
alternative Kullback-Leibler divergence contrast function and under some regularity 
conditions that the set of parameters optimizing this contrast function includes the 
true parameter. Because the size of the blocks is fixed, the degeneracy of particle 
filters can be averted and the quality of the approximations can be further improved 
by applying a backward smoothing step over each block ([69]). Let k denote the index 
of the block of length L considered and y k = y ' (h-i)L-\-i-.kL an d ©fc = 0(fc-i)z,+i:k-L- 
If r = r L. The likelihood is approximated by: 



When 04 has reached a stationary regime with invariant density i>q(. | 0), then for 
any k, (©, 0^, y^) are identically distributed according to: 



/v 



r 



p(y 1:T |0,0 1:r )« l[ P (y k \®,@ k ) 



(2.26) 



P(® 7 & k7 y k ) 



7r(0)^o(©( fe _i) L | ®)p(V(k-i)L I 0(fc-i)L,0) 

nS-DL+iK®* I ®t-x,®) P (y t I ©*,©) 



(2.27) 



In a batch EM algorithm using the split-data likelihood and the k block of 
data, the M-step would involve maximization with respect to of (see also Equation 



Baycsian reduced-order models 



15 



Q(0( fc - 1 ),©) = j p(& k | ©^sy logj>(0,0 fe ,y fe ) d& k (2.28) 

We utilize an online EM algorithm where the iteration numbers s coincide with the 
block index k (i.e. s = k ) which effectively implies that the estimates for are 
updated every time a new data block is considered. The expectation step (E-step) is 
replaced by a stochastic approximation ( |119l[T§] ) while the M-step is left unchanged. 
In particular, at iteration k (= s): 

online E-step Q(0 (1:fc " 1) , 0) = (1 - 7/c )Q(0 (1:fe ~ 2) , ©) 

+j k Jp(G k | ©^yJlogpC©*,^) d@ k 

(2.29) 

and update the value of the parameters © as: 

online M-step & {k) = arg max© Q(0 (1:fc " 1) , 0) (2.30) 

The algorithm relies on a non- increasing sequence of positive stepsizes {7fc}fc>o such 
that J2 k Ik = +oo and ^ fe 7^ < +00. In this work we adopted j k = with a = 0.51. 
Naturally the integrals above over the hidden dynamic variables 0^ are estimated 
using SMC-based, particulate approximations of p(& k | @' fe_1 - ) y fe ). For small L the 
convergence will in general be slow as the split-block likelihood assumption will be 
further from the truth. For larger L, convergence is faster but the performance of 
the filter decays. For that purpose we also employed a backward smoothing filter 
over each block using the algorithm described in [69] . The computational cost of the 
smoothing algorithm is 0(N 2 L). 

In practice, and in particular for the exponential distributions utilized in the 
proposed model (e.g. Equation ()2.17|) ). the EM iterations reduce to calculating a 
set of (multivariate) sufficient statistics In particular, instead of the log-posterior 
lower bound Q(® i ' 1 ' k ~ 1 \ 0) in Equation (|2.29[) we update the sufficient statistics as 
follows: 

*<*> =(l- 7fe )* (fc ~ 1) f2311 
+ lk fp(@ k \@( k - 1 \y k )cf>(® k )d@ k { - ' 

where J p(& k \ 0^ _1 - ) , y k )<p(& k ) d& k denotes the set of sufficient statistics associ- 
ated with the block of data y = y( k -i)L+i-.kL- Specific details are provided in the 
Appendix. It is finally noted, that learning tasks in the context of the probabilistic 
model proposed, should also involve identifying the correct structure (e.g. the num- 
ber of different experts M). While this problem poses some very challenging issues 
which are currently the topic of active research in various contexts (e.g. nonpara- 
metric methods), this paper is exclusively concerned with parameter learning. In 
section |3l we discuss Baycsian validation techniques for assessing quantitatively the 
correct model structure which are computationally feasible due to the efficiency of the 
proposed algorithms. 

2.3.1. Locally optimal Sequential Monte Carlo samplers. In this section 
we discuss Monte Carlo approximations of the expectations appearing in Equation 
d23U) with respect to the density p(@ k | 0,y 1:L ) = p(®(k-i)L+i-.kL I ®,V(k-i)L+ukL)- 
Note that in order to simplify the notation we consider an arbitrary block of length L 
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(e.g. k = l)and do not explicitly indicate the iteration number of the EM algorithm. 
Hence the target density is: 

1 L 
p (&ul I e jVl:£ ) = p@y^o(©i I S)p(y 1 I ®i,®)[[p(®t I t -i,0)p(y t | t ,0) 

(2.32) 

where the dynamic variables are © t = (X t ,z t ) (Equation (|2.18jl ). Based on earlier 
discussions, the evolution dynamics of these variables are independent i.e.: 

14,(01 | 0) = u (Xx | ®)v ( Zl | 0) (2.33) 

and: 

p(® t | e t _i, ©) = p(X t I Xt-i, ®)p{z t I xt-i, ©) (2.34) 

Since there is a deterministic relation between z t and z t (Equation (|2.1ip ) we use 
them interchangeably In particular we use z t in the evolution equations since the 
initial and transition densities are Gaussian (Equation (|2.10[l ) and z t in the likelihood 
densities as the expressions simplify in Equation (|2.12jl ). The initial and transition 
densities for X t are also Gaussian. Given that x[ are a priori independent, we have 
that: 

p(X t \X t -i,B) =U^P(4 m) \4-l®) f2 35) 

= Af(X t \^S x ) {2 - 6b) 

where the mean /x t = fj, t (Xt-i) is given by Equation (|2.9p and Sx = diag(S Xi i, . . . , S Xj m) 
(from Equation (|2 .9[) as well). 

SMC samplers operate on a sequence of target densities p(®ut \ Vi-t, 0) which, 
for any t, are approximated by a set of n random samples (or particles) {0^)}™ =1 . 
These are propagated forward in time using a combination of importance sampling, 
resampling and MCMC-based rejuvenation mechanisms [33 [37l I138[ 1 133J . Each of 
these particles is associated with an importance weight (X^ILi = 1) which is 
updated sequentially along with the particle locations in order to provide a particulate 
approximation: n 

p(©i :t | y 1:t , 0) « W® 6 B m (0 1:f ) (2.36) 

where (.) is the Dirac function centered at ®i) t . Furthermore, for any measurable 
<t>(@i:t) (as in Equation (|2~3T]l ) and Vt j36l [26| [20] : 

3(0,0i:t) -> / 0(0i:t) p(0i:t | V l!t ,e)dei:t (almost surely) (2.37) 
i=i ^ 

The particles are constructed recursively in time using a sequence of importance 
sampling densities qt(®t | ®t-i, Dti ®)- The importance weights are determined from 
the fact that: 

P(Ql:t I 2/l:t, 0) = P(©l:t-1 I 2/l:t-l, ©J 7" j~" 7^ (2-38) 

PVJ/« I i/l:t-l) "J 

Let {W^, 0i l )_i}ili the particulate approximation of p(©i : *_i | yi : j_i,0)- Note 
that for t = 1 and for the Gaussian initial densities v$ of the proposed model, this 
consists of exact draws and weig hts W® = i. At time t we proceed as follows ( [45] ) : 
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1. Sample t (i) ~ q t (@ t \ &fl 1: y t ,&),^i and set ©g <- ©t ) 

2. Compute incremental weights: 



ft (eW 1®^,©) 



(2.39) 



and update the weights: 



W« = — f '—n (2.40) 

3. Compute ES 5 = ^ w ^)y an d if < ESS m in perform multino- 

mial resampling to obtain a new population with equally weighted particles 
{ESSmin = N/2 was used in this study). Set t t + 1 and go to step 1. 
It can be easily established ([IS]) that the locally optimal importance sampling 
density is: 

9t (fc> t | fa» t -i,!/t,fa>) - — — — (2.41) 

J p(0 t | © t _i, 0>(y t | & t , ©) d©* 

In practice, it is usually impossible to sample from q° pt and/or calculate the integral 
in the denominator. As a result, approximations are used which nevertheless result in 
non-zero variance in the estimators. In this paper we take advantage of the fact that 
the transition density of X t as well as the likelihood, conditioned on z t are Gaussians 
and propose an importance sampling density of the form: 

rv - l v - n\ (-\- p(X t \X t _ 1 ,®)p(y t \X t ,z t ,®) 

q t (X t ,z t | X t _i,z t _i,2/ t ,0) =p(z t | Zi_i,0)- 



fp(X t | X t _ 1 ,@)p(y t | X t ,z u &) dX t 

(2.42) 

This implies using the prior to draw z t and the locally optimal distribution (condi- 
tioned on Zt or equivalcntly Zt) for X t - The latter will also be a Gaussian whose 
mean /2 t and covariancc Sx can be readily be established (e.g. using Kalman filter 
formulas): 



h = (s x l + w^- l w t 



(2.43) 

fL t = s x (s^Vt + wj s- 1 ^) 

As a result the incremental weights u t are given by: 

ut =| S x \ 1/2 exp&fi&jMt - ^fS^Vj (2-44) 



2.4. Prediction and Bayesian adaptive time-integration. Bayesian infer- 
ence results do not include just point estimates but rather samples from the posterior 
density, at least with repsect to the time- varying parameters ©t . The inferred poste- 
rior can be readily used to make probabilistic predictions about the future evolution of 
the high-dimensional, multiscalc process y t . Given observations y 1 . T = {y t }1=ii the 
predictive posterior for the future state of the system y T , 1:r , T over a time horizon T 
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can be expressed as: 




PiUr+l-.T+T I ©,©r + l:r+T) p(&, t +1:t+T | Vl; T )d@ d& T + V . r+T 



r+l-.T+T- 



0, T+ l :r+T | Vx-.r) d&d& T+1:T+T 




prior Eauation i 2. 20l posterior Equation 



The integral above can be approximated using Monte Carlo. In particular given 
the particulate approximation of the posterior p(®, 0i :T | 2/i- T ) (which consists of 
samples of the dynamic variables ©t and the MAP estimate of 0), samples from 
the prior p(& t +1:t+t | ©t, 0) and subsequently the likelihood p(® T +i :T +T I ®t, 0) 
can readily be drawn. In fact, given that the latter is a multivariate Gaussian, the 
predictive posterior will consist of a mixture of Gaussians, one for each sample of 
0t+1:t+t drawn. 

The important consequence of the Bayesian framework advocated is that precitive 
estimates are not restricted to point estimates but whole distributions which can read- 
ily quantify the predicitve uncertainty. This naturally gives rise to Bayesian, adaptive, 
time-integration scheme that allows bridging across timescales while providing quanti- 
tative, probabilistic estimates of the accuracy of the coarse-grained dynamics (Figure 
I2.5[) . The distribution of Equation (|2.45p is used to probabilistically predict the evo- 
lution of the system. The time range over which the reduced model is employed does 
not have to be specified a priori but can be automatically determined by the variance 
of the predictive posterior (Figure [2~5)l . Once this exceeds the allowable tolerance 
specified by the analyst, the fine-scale process is reinitialized and more data are ob- 
tained, that can in turn be used to update the coarse-grained model. It is emphasized 
that due to the generative character of the model proposed, the reinitialization can be 
performed consistently based in general on the emission equations Equation (|2.12[) . 
In contrast to existing techniques such as projective and coarse-projective integration 
[Uj] E>S [Ul [UUJ [UUJ EH as well as Heterogeneous Multiscale Methods gSJ HUT], there 
is no need to prescribe lifting and restriction operators and no ambiguity exists with 
regards to the appropriateness of the reinitialization scheme. Furthermore, the prob- 
abilistic coarse-grained model provides quantitative estimates for its predictive ability 
and automatically identifies the need for more information from the fine-scale model. 

3. Numerical experiments. The first examples is intended to validate the 
accuracy of the proposed online EM scheme and utilizes a synthetic dataset. The sec- 
ond example uses actual data and illustrates the superiority of the proposed PMLDS 
model over existing SLDS models. Finally the third example provides an application 
in the context of multiscale simulations for the time-dependent diffusion equation. 

3.1. Synthetic data. We generated data from the proposed model in order to 
investigate the ability of the inference and learning algorithms discussed. In particular, 
we considered a mixture of two M = 2, onc-dimcnsional OU processes (K = 1) as in 
Equation $LE\) with (&«, q { *\ = (0.1,-5.0,0.2) (slow) and (b^ 2 \ qi 2) , S (2) ) = 

(1., +5.0, 2.0) (fast). The logistic normal distribution was used to model the weights 
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Fig. 2.5. Baycsian adaptive time-integration and data-model fusion illustrated for a two- 
dimensional flow. The data generated from computational simulations j/i :r and/or experiments 
y 1:T / are sequentially incorporated in the Bayesian model and the posterior p(&,&i :T \ yi :T ) over 
dynamic and static parameters is updated. The predictive posterior p(j/ T ^ 1 . T+T | Vi. T ) is over 
the time horizon T used to efficiently produce probabilistic predictions of the evolution of the high- 
dimensional process y t in the future. When the uncertainty associated with those predictions exceeds 
the analysts' tolerance, the original system is consistently reinitialized and more data are generated. 
These are used to update the (predictive) posterior and to produce additional predictive estimates, 
ft is noted that the tolerance in the predictive uncertainty can also be measured with respect to 
(low-dimensional) obscrvablcs which arc usually of interest in practical applications. 



associated with each of the hidden processes using an isotropic Ornstein-Uhlcnbcck 
process dz t = -b z (z t - q z )dt + T,l /2 dW t with b z = 1.0, q z = [0, 0] T and H z = 

Two 10 x 1 projection vectors P m ,m =1,2 were generated from the 



10. 

prior 7V(0, 100/) (sec Appendix) and (d 



10) time series y t were produced based on 



Equation (|2.12j) with idiosyncratic variances X = 0.1 2 J and time step St = 1. The 
resulting time series exhibit multimodal, non-Gaussian densities as can be seen in 



Figure 3.1(a) as well as two distinct time scales as it can be seen in the autocovariances 



plotted in Figure 3.1(b) 



Figure 13.21 depicts the convergence of the proposed online EM scheme to the 
reference values of b^ n \m = 1,2, for various block sizes L and particle populations 
N. Figure 13.31 depicts the evolution of the log-likelihood per iteration of the EM 
algorithm. Figure 3.4(a) depicts the normalized error in the identified P (m , m = 1, 2 



and isiosyncratic variances S pre coordinate after 20,000 iterations. In all cases the 
algorithm exhibits good convergence to the reference values. 

3.2. Temperature Dataset. The goal of this numerical experiment is to illus- 
trate the interpretability of the proposed model and compare with the switching-state 
linear model discussed in section (Equation (|2.6p ). For that purpose we utilized the 
temperature data (in degrees Fahrenheit) of the capitals of the 50 states in the U.S. A 
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(a) density 



(b) autocovariancc 



Fig. 3.1. Densities (a)) and Autocovariances (b)) of times series y t 1 (black), y t 2 (red) and 
y t 4 (blue) (solid lines). With (— o — ) the densities and autocovariances of the same times scries 
generated using the learned model parameters using the proposed online EM scheme with L = 200 
and N = 200 (see Figures [3721 and \SAl 



Oe+OO 2e+04 4e+04, . , 6e+04 8e+04 le+05 

time t 

Fig. 3.2. Convergence of b^ (black) and b^ (red) using the online EM algorithm for three 
different combinations of L and N. (— o —) corresponds to L = 100, N = 100, (—()—) to L = 200, 
N = 200 and (-A-) to L = 20, N = 1000 



(d = 50). The data was obtained from http://www.engr.udayton.edu/weather/citylistUS.htm 
and it represents the average daily temperatures from 01/01/1996 until 01/13/2009 
(i.e. 5, 127 daily observations). 

Figure 13.51 depicts the posterior memberships corresponding to the SLDS and 
PMLDS models based on a reduced model with two hidden states (M = 2) described 
by one-dimensional OU processes (K = 1). The former assumes that at each time 
instant the observables y t arises from a single hidden process. Hence a single entry 
of Zt = [zi.t,Z2,t, ■ ■ ■ ,zM,t] is equal to 1 and the rest are all equal to 0. The top 
part of Figure 15751 shows the posterior mean of z m j,m = 1,2. As one would expect 
the two-states correspond to cold-winter (blue) and hot-summer (red) and alternate 
periodically (roughly the cold-winter state is active between early November until 
mid- April and the hot-summer state in the remainder of the calendar year). The top 
part of Figure [XT] depicts the corresponding P^ m \ m = 1,2 where southern states have 
higher values and northern states lower. Naturally, winter and summer represent the 
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Fig. 3.3. Log-likelihood convergence using the online EM algorithm for three different combi- 
nations of L and N. (- o -) corresponds to L = 100, N = 100, (-Q-) to L = 200, N = 200 and 
(-A-) to L = 20, N = 1000 
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Fig. 3.4. Normalized absolute errors (per coordinate) on the identification of the projection 

V6CtOI*S fYl — ^ *^ Q n iclinoi rm /irotif irari dti^oc <-r^ 

scheme with L = 200 and N = 200 



1 , 2 and idiosyncratic variances crt , j = 1 , . . . , d using the proposed online EM 



two extremes but several intermediate states are also present. The proposed partial- 
membership model can account for those states without increasing the cardinality of 
the reduced-order dynamics. As it can be seen in the bottom part of Figure 13. 51 which 
depicts the particulate approximation of the posterior of z m ,t, m = 1,2, the two hidden 
states can also be attributed to the two extremes but the actual temperatures arise 
by a weighted combination of these two. Naturally during spring-summer months the 
weight of the "red" state is higher and during autumn-winter months the weight of 



the "blue" state takes over, 
depicted in Figure [ 



The posterior of the hidden processes x 



(m) 



1,2 



In order to quantitatively compare the two models we calculated the average, 
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Fig. 3.5. (Top) Posterior mean of z m t,ni = 1,2 based on the SLDS and (Bottom) particulate 
approximation of the posterior of z mi t,m = 1,2 PMLDS. Both results were obtained using the 
previously discussed online EM scheme with L = 200 and N = 100 



M 


K 


SLDS 


PMLDS 


2 


1 


-179.97 ±37.31 


-171.11 ±37.20 


2 


2 


-170.68 ±36.95 


-141.11 ±27.82 


4 


1 


-176.40 ±34.36 


-143.81 ±25.56 


4 


2 


-166.05 ±30.57 


-117.67±21.15 



Table 3.1 



One-step-ahead predictive log-likelihood (Equation i3. IV ) of SLDS and PMLDS models for var- 
ious M, K. The table reports the average value plus/minus one standard deviation in bits. All 
results were obtained using the previously discussed online EM scheme with L = 200 and N = 100 



one-step-ahead, predictive log-likelihood logp(y t+1 | yl : i),Vt G [0, T): 

P(Vt+l I Vl:t) = J lo EP{y t +l I ©, ©t+lWl:i) P{®,@t+1 I y 1:t )d@d& t+1 

= J iogp(y t+ i I ®,®t+iyut) 

p(& t+1 I t , 0) p(e, ® t I y vt ) d® d® t:t+1 (3.1) 

^ / s / 

prior posterior 

The latter integral is approximated by Monte Carlo using the MAP estimate of 
and the particulate approximation of the posterior for the dynamic variables ®t. This 
provides a measure of how well the model generalizes to a novel observation from the 
same distribution as the training data and higher values imply a better model. Table 
13.11 reports the average values (in bits) plus/minus the standard deviation (over t G 
(100, T = 5127) ). Similar calculations were carried out for other model cardinalities 
(i.e. M, K) and in all cases the proposed model exhibited superior performance. This 
superiority becomes more pronounced as M and K increased which can be attributed 
to the factorial character of PMLDS. 
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Fig. 3.6. Particulate approximation of the posterior of x^' , m = 1,2. Results were obtained 




Fig. 3.7. Maximum posterior estimates of p( m ) G R 50 ,m = 1,2 based on the SLDS(top) 
and PMLDS (bottom) models. Both results were obtained using the previously discussed online EM 
scheme with L = 200 and N = 100 



3.3. Transient Heat equation. We finally apply the proposed analysis scheme 
to the one-dimensional transient heat equation: 

W = 35 (°(*)&) > f 32 ) 
u(0,t)=u(M) = 0, Vi 1 ; 



The spatial domain was discretized with 1, 000 finite elements of equal length and we 
considered a "rough" conductivity profile shown in Figure 3.8(a) The conductivity 
a{x) in each finite element was assumed constant and its value was drawn indepen- 
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dcntly from a uniform distribution 0. For x G [0, 0.5] wc used the uniform [/[0.01, 0.1] 
and for x G (0.5,1], {/[0. 51, 0.6]. This naturally resulted in the jump observed in 



Figure 3.8(a) which as a consequence gave rise to two distinct slow time scales in 
the solution profile u(x,t) depicted in Figure 
ditions was also used (as can be seen Figure 



3.8(b) A rough profile of initial con- 



3.8(b) t = 0). In particular at each 



node Xi = O.OOli, i = 1,...,1001 we set u(xi,0) = lOx^l - Xi)(l + O.IZi) where 
Zi~JV(0,l) (i.i.d). 

Upon spatial discretization, we obtain a coupled system of ODEs: 

y t + Ky t =0 (3.3) 

where y t G R 1001 represents the solution at the nodes Xi, i.e. y t = [u(xi,t), u(x2,t), . . . , 
. . . , u(xd, t)] T . In contrast to existing approaches for the same diffusion equation (e.g. 
[5J [TJ 1118] ) we do not exploit mathematical properties of the PDE in specifying the 
coarse-grained model but rely on data. This data is obtained upon temporal dis- 
cretization of Equation (|3.3[) where a time step St = 0.0001 was used. As a result at 
each time step we obtained a vector of observables y t of dimension d = 1001. This 
data was incorporated in the Bayesian model proposed using two hidden OU process 
(M = 2) of dimension K = 2 each. In particular we employed the online EM scheme 
previously discussed over blocks of length L = 10 time steps and N = 100 particles. 
In particular (see also Figure [23]) : 

• data over 20 times steps St, y 1 . 2 o (i-e. corresponding to total time 20St = 
0.002) were ingested by the Bayesian reduced model, and 

• the latter was used to predict the evolution of the system over 500 time steps 
(i.e. total time T = 500St = 0.05). 

• The original solver of the governing PDE was then re-initialized using the 
posterior mean estimate of the state of the system y 52 o an d was run for further 
20 time steps. Using the additional data obtained 2/521:540; the Bayesian 
model was updated and the process described was repeated. 

It is noted that the proposed Bayesian prediction scheme results in a reduction of the 
number of fine scale integration time steps by a factor of 25 (T/20St = 0.05/0.002) 
leading to a significant acceleration of the simulation process. Figure [3791 depicts the 
posterior estimates of the solution at various time instants. In all cases these ap- 
proximate very accurately the exact solution and these estimates improve as more 
are accumulated. One of the main advantages of the proposed approach is that apart 
from single-point estimates one can readily obtain credible intervals that quantify pre- 
dictive uncertainties due to information loss by the use of the reduced-order dynamic 
model and the finite amount of data used to learn that model. As it is seen in Figure 
I3.9l these envelop the exact solution and become tighter at larger times. As one would 
expect, when a larger predictive horizon T = 0.1 (i.e. 1,000 time steps St) is used, 
as it can be seen in Figures 13.101 and 13.111 the predictive uncertainty grows. Such a 
scheme however is twice as efficient leading to a reduction of computational effort by 
a factor of 50 (i.e. T/205t = 0.1/0.002). Hence if the analyst is willing to tolerate 
the additional uncertainty, efficiency gains can be achieved. This supports the argu- 
ments made previously with regards to an adaptive Bayesian scheme where, the level 
of predictive uncertainty would be specified and the algorithm would automatically 
revert to the fine scale model in order to obtain more data and improve the predictive 
estimates. 



2 We considered a single realization of the conductivity profile and solved for it as a deterministic 
problem. The stochastic PDE where a(x) is random is not considered here. 
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(a) conductivity 



(b) Exact solution profiles u(x, t) at various t 



Fig. 3.8. Dynamic heat equation. Spatial discretization with 1,000 finite elements. Time 
discretization using St = 1 X 10 — 4 . 




(a) t = 0.15 
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Fig. 3.9. Comparison of predictive posterior estimates (posterior mean and 5% and 95% quan- 
tilcs) with exact solution u(x,t) at various t 



4. Conclusions. The proposed modeling framework can extract interpretable 
reduced representations of high-dimensional systems by employing simple, low-dimensional 
processes. It simultaneously achieves dimensionality reduction and learning of reduced 
dynamics. The Bayesian framework adopted provides a generalization over single- 
point estimates obtained through maximum-likelihood procedures. It can quantify 
uncertainties associated with learning from finite amounts of data and produce prob- 
abilistic predictive estimates. The latter can be used to rigorously perform concurrent 
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0.2 0.4 „, 0.6 0.8 1 0.2 0.4 ™ 0.6 0.8 1 



(a) t = 0.25 (b) t = 1.0 

Fig. 3.10. Comparison of predictive posterior estimates (posterior mean and 5% and 95% 
quantiles) with exact solution u(x,t) at various t. These results were obtained with a prediction 
horizon T = 0.1 (St = 0.0001) in contrast to Figure I3~9l which were obtained for T = 0.05. 




(a) T = 0.05 (b) T = 0.1 



Fig. 3.11. Comparison of predictive posterior estimates (posterior mean and 5% and 95% 
quantiles) with exact solution u(x, t) at t = 1.0. These results were obtained with a prediction 
horizon (a) T = 0.05 and (b) T = 0.1. 

simulations with the microscopic model without the need of prescribing ad hoc up- 
scaling and downscaling operators. 

Critical to the efficacy of the proposed techniques is scalability particularly with 
regards to the large dimension d of the original process. The algorithms proposed 
imply 0{d) order of operations. Furthermore they can dynamically update the coarse- 
grained models as more data become available. In a typical scenario, the fine-scale 
model is reinitialized several times in order to obtain additional information about 
the system's evolution that is incorporated in the coarse-grained dynamics "on the 

The Bayesian, statistical perspective can readily be extended to the modeling 
stochastic dynamical systems. This would require generating more than one realiza- 
tions of the original dynamics which can nevertheless be incorporated in the coarse- 
grained models using the same online EM scheme. In fact the loss of information that 
unavoidably takes place during the coarse-graining, results in probabilistic reduced- 
order models even if the original model was deterministic. A critical question that 
offers opportunity for future research on the topic relates to structural learning and 
in particularly with the dimensionality of the representation, i.e. the number of hid- 
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den processes xf 1 ' needed (denoted by M in Equation ([HT])). Treating this as a 
model selection problem as it was done in the examples, assumes that there is a sin- 
gle, optimal finite-dimensional representation. Current research activities are centered 
around nonparametric Baycsian priors over infinite combinatorial structures based on 
the Dirichlet process paradigm and infinite latent features models (e.g. |1 29( 170 1 [75]). 
These offer an alternative perspective by assuming that the number of building blocks 
is potentially unbounded, and that the observables only manifest a sparse subset of 
those. As a result, the cardinality of the coarse-grained model can be automatically 
determined from the data. Another aspect that warrants further investigation is 
prior modeling of the static parameters. Apart from the regularization effect this 
offers, it can promote the discovery of desirable features, such as slow-varying essen- 
tial dynamics, sparse factors (e.g. _p( m ) in Equation (|2.12[1 ) which can advance the 
intcrprctability of the results and facilitate the inference process. 
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Appendix. This appendix discusses the sufficient statistics and update equations 
for the static parameters used in the probabilistic model proposed. In the first 
section we discuss parameters appearing in the reduced-order dynamics models and 
in the second those appearing in the likelihood. 

Sufficient statistics for parameters appearing in the prior. As discussed 
in section [2. 2 .11 independent, isotropic OU processes are used as prior models for the 
latent, coarse-grained dynamics x[ m ' G R K as well the process z t e R M that models 
the frctional memberships to to each process m. We therefore discuss the essential 
elements for the online EM computations described in section |2~31 (0 [3 IB]) for a 
general isotropic OU process in l n of the form: 

dx t = -bfa - q)dt. + S 1/2 dW t (4.1) 

It is of interest to determine the parameters 9 = (b,q,S). Let also ir(6) denote 
the prior on 6. The readers can adjust the expressions below to any x\ or z t 
since independent priors were used. Note that the stationary distribution of x t is a 
Gaussian: 

vo(x)=tf(x\q t C=±S) (4.2) 

and the transition density p{x t \ x t -\) assuming that equidistant time instants with 
time step St are considered, is given by: 

p(x t | x t -i) = N{x t | /j, st (x t -i), Sg t ) (4.3) 

where: 

/i Jt (x t _i) = *t_i - (1 - e- m )(x t ^ - q) (4.4) 

and: 

1 _ -2bSt 

Sst = 5 (45) 

Given a block of length L with observables y 1 . L and according to Equations (|2.27l) 
and (|2.28[) we have that: 

6(0^,6) ={-\\og\C\-\(x 1 -q) T C- 1 (x 1 -q) 

+ Ef=2 "I log I S st I -i(x t - q) T C- 1 (x 1 - q)) 
+ \ogTr(6) 

where the brackets < . > imply expectation with respect to p(xi-.L | 6 iVul) as 
in Equation (|2.28[) . In order to maximize Q(®^ uk ~ 1 ' , 0) as in Equation (|2.30|) one 

needs to solve the system of equations arising from — — — ^ — — = These equations 
equations with respect to 6 are solved with fixed point iterations. They depend on 
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the following 7 sufficient statistics <& = {$ 3 }^ =1 : 
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l x i- 
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$6 




- x t 




$7 




- x t 


-i)(xt 



(4.6) 



Sufficient statistics for parameters appearing in the likelihood. The 

process a bit more involved in the case of the parameters appearing in the likelihood 
Equation (|2.14[) i.e. the projection matrices {P }£f=i 01 dimension d x K and 
the covariance S which is a (positive definite) matrix of d x d. In order to retain 
scalability in high-dimensional problems (i.e. d » 1) we assume a diagonal form of 
X = diag(al, a%, . . . , c 2 .) which implies learning d parameters rather than d(d + l)/2. 

Denoting now by 6 — ({P^}m=ii { a j}j=i) > ""W tne prior and according to 
Equations (|2~27)l and (pT28)) we have that: 



Q(0 (fe_1) ,0) = (Ef =1 -| log | s | -i(y t - WtXt) 2 ^- 1 ^ - w t x t ; 



+ log7r(0) 

Differentiation with respect to p( m ) reveals that the stationary point must satisfy 



(4.7) 



j^{m) _ p(n) £j(n,m) 



n=l 



where the sufficient statistics are: 

/ l 



1,2,...,M 



IxJf 



I t=l 



and: 




E(n)/ (ro)sT 
Zt,nZt,mXt \Xt ) 



(4.8) 



(4.9) 



(4.10) 



/=i 



In the absence of a prior tt(9) and if P^ m - ) and Aj represent the j th rows (j 
1, . . . , rf) of the matrices P' m ' and A*-" 1 -* respectively, then Equation (|4.9[) implies: 



,(2) 



Aj:(lxiif M) 



,(2) 



>(M) 



Pj-.(lxK M) 
B (l,l) B (l,2) 

B (2,l) B (2,2) 



B 



(M,l) 



(M,2) 



B-.(MKxMK) 



B (2,M) 



B 



(M,M) 



(4.11) 
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This leads to the following update equations for P™ , Vj,m: 

Pj = AjB- 1 (4.12) 

Note that the matrix B to be inverted is independent of the dimension of the 
observables d (d » 1) and the inversion needs to be carried out once for all j = 
1, . . . ,d. Hence the scaling of the update equations for p( m ^ is 0(d) i.e. linear with 
respect to the dimensionality of the original system. 

Furthermore, in the absence of a prior tt(0), differentiation with respect to a J 2 
(j = 1, . . . , d) leads to the following update equation: 

Lcr] =Ylf =1 yl j -2A j Pf + P j BPj (4.13) 

In summary the sufficient statistics needed are the ones in Equations (|4. 9|) and (|4.12l) . 

In the numerical examples in this paper a diffuse Gaussian prior was used for _p( m ) 
with variance 100 for each of the entries of the matrix. This leads to the addition of 
the term 1/100 in the diagonal elements of the B in Equation (|4.12p . No priors were 
used for a 2 . 
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